Posts with tag Digital Humanities Now Editors' Choice
← Back to all posts
Now back to some texts for a bit. Last spring, I posted a few times about the possibilities for reading genders in large collections of books. I didn’t follow up because I have some concerns about just what to do with this sort of pronoun data. But after talking about it to Ryan Cordell’s class at Northeastern last week, I wanted to think a little bit more about the representation of male and female subjects in late-19th century texts. Further spurs were Matt Jockers recently posted the pronoun usage in his corpus of novels; Jeana Jorgensen pointed to recent research by Kathleen Ragan that suggests that editorial and teller effects have a massive effect on the gender of protagonists in folk tales. Bookworm gives a great platform for looking at this sort of question.
Following up on my previous topic modeling post, I want to talk about one thing humanists actually do with topic models once they build them, most of the time: chart the topics over time. Since I think that, although Topic Modeling can be very useful, there’s too little skepticism about the technique, I’m venturing to provide it (even with, I’m sure, a gross misunderstanding or two). More generally, the sort of mistakes temporal changes cause should call into question the complacency with which humanists tend to ’topics’ in topic modeling as stable abstractions, and argue for a much greater attention to the granular words that make up a topic model.
Note: this post is part 5 of my series on whaling logs and digital history. For the full overview, click here.
Note: this post is part 4, section 2 of my series on whaling logs and digital history. For the full overview, click here.
It’s pretty obvious that one of the many problems in studying history by relying on the print record is that writers of books are disproportionately male.
Though I usually work with the Bookworm database of Open Library texts, I’ve been playing a bit more with the Google Ngram data sets lately, which have substantial advantages in size, quality, and time period. Largely I use it to check or search for patterns I can then analyze in detail with text-length data; but there’s also a lot more that could be coming out of the Ngrams set than what I’ve seen in the last year.
[This is not what I’ll be saying at the AHA on Sunday morning, since I’m participating in a panel discussion with Stefan Sinclair, Tim Sherrat, and Fred Gibbs, chaired by Bill Turkel. Do come! But if I were to toss something off today to show how text mining can contribute to historical questions and what sort of issues we can answer, now, using simple tools and big data, this might be the story I’d start with to show how much data we have, and how little things can have different meanings at big scales…]
When data exploration produces Christmas-themed charts, that’s a sign it’s time to post again. So here’s a chart and a problem.
Ted Underwood has been talking up the advantages of the Mann-Whitney test over Dunning’s Log-likelihood, which is currently more widely used. I’m having trouble getting M-W running on large numbers of texts as quickly as I’d like, but I’d say that his basic contention–that Dunning log-likelihood is frequently not the best method–is definitely true, and there’s a lot to like about rank-ordering tests.
Historians often hope that digitized texts will enable better, faster comparisons of groups of texts. Now that at least the 1grams on Bookworm are running pretty smoothly, I want to start to lay the groundwork for using corpus comparisons to look at words in a big digital library. For the algorithmically minded: this post should act as a somewhat idiosyncratic approach to Dunning’s Log-likelihood statistic. For the hermeneutically minded: this post should explain why you might need _any_ log-likelihood statistic.